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VARIABLE SELECTION IN SEMIPARAMETRIC 
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In this paper, we are concerned with how to select significant vari- 
ables in semiparametric modeling. Variable selection for semipara- 
metric regression models consists of two components: model selection 
for nonparametric components and selection of significant variables 
for the parametric portion. Thus, semiparametric variable selection is 
much more challenging than parametric variable selection (e.g., linear 
and generalized linear models) because traditional variable selection 
procedures including stepwise regression and the best subset selection 
now require separate model selection for the nonparametric compo- 
nents for each submodel. This leads to a very heavy computational 
burden. In this paper, we propose a class of variable selection proce- 
dures for semiparametric regression models using nonconcave penal- 
ized likelihood. We establish the rate of convergence of the resulting 
estimate. With proper choices of penalty functions and regulariza- 
tion parameters, we show the asymptotic normality of the resulting 
estimate and further demonstrate that the proposed procedures per- 
form as well as an oracle procedure. A semiparametric generalized 
likelihood ratio test is proposed to select significant variables in the 
nonparametric component. We investigate the asymptotic behavior 
of the proposed test and demonstrate that its limiting null distri- 
bution follows a chi-square distribution which is independent of the 
nuisance parameters. Extensive Monte Carlo simulation studies are 
conducted to examine the finite sample performance of the proposed 
variable selection procedures. 

1. Introduction. Semiparametric regression models retain the virtues of 
both parametric and nonparametric modeling. Hardle, Liang and Gao [13], 
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Ruppert, Wand and Carroll [21] and Yatchew [26] present diverse semi- 
parametric regression models along with their inference procedures and 
applications. The goal of this paper is to develop effective model selec- 
tion procedures for a class of semiparametric regression models. Let Y 
be a response variable and {U, X,Z} its associated covariates. Further, let 
/x(it,x, z) = E(Y\U = u,X = x, Z = z). The generalized varying-coemcient 
partially linear model (GVCPLM) assumes that 

(1.1) g{fi(u, x, z)} = x t q(u) + z T (3, 

where g(-) is a known link function, (3 is a vector of unknown regression 
coefficients and ct(-) is a vector consisting of unspecified smooth regression 
coefficient functions. Model (1.1) is a semiparametric model, z T f3 is referred 
to as the parametric component and x. T a(u) as the nonparametric compo- 
nent as a(-) is nonparametric. This semiparametric model retains the flexi- 
bility of a nonparametric regression model and has the explanatory power of 
a generalized linear regression model. Many existing semiparametric or non- 
parametric regression models are special cases of model (1.1). For instance, 
partially linear models (see, e.g., Hardle, Liang and Gao [13] and references 
therein), generalized partially linear models (Severini and Staniswalis [23] 
and Hunsberger [15]), semivarying-coefficient models (Zhang, Lee and Song 
[27], Xia, Zhang and Tong [25] and Fan and Huang [8]) and varying coef- 
ficient models (Hastie and Tibshirani [14] and Cai, Fan and Li [4]) can be 
written in the form of (1.1). Thus, the newly proposed procedures provide 
a general framework of model selection for these existing models. 

Variable selection is fundamental in statistical modeling. In practice, a 
number of variables are available for inclusion in an initial analysis, but 
many of them may not be significant and should be excluded from the fi- 
nal model in order to increase the accuracy of prediction. Variable selection 
for the GVCPLM is challenging in that it includes selection of significant 
variables in the nonparametric component as well as identification of signif- 
icant variables in the parametric component. Traditional variable selection 
procedures such as stepwise regression and the best subset variable selection 
for linear models may be extended to the GVCPLM, but this poses great 
challenges because for each submodel, it is necessary to choose smoothing pa- 
rameters for the nonparametric component. This will dramatically increase 
the computational burden. In an attempt to simultaneously select signifi- 
cant variables and estimate unknown regression coefficients, Fan and Li [9] 
proposed a family of variable selection procedures for parametric models via 
nonconcave penalized likelihood. For linear regression models, this family in- 
cludes bridge regression (Frank and Friedman [12]) and LASSO (Tibshirani 
[24]). It has been demonstrated that with proper choice of penalty function 
and regularization parameters, the nonconcave penalized likelihood estima- 
tor performs as well as an oracle estimator (Fan and Li [9]). This encourages 
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us to adopt this methodology for semiparametric regression models. In this 
paper, we propose a class of variable selection procedures for the paramet- 
ric component of the GVCPLM. We also study the asymptotic properties 
of the resulting estimator. We illustrate how the rate of convergence of the 
resulting estimate depends on the regularization parameters and further es- 
tablish the oracle properties of the resulting estimate. To select significant 
variables in the nonparametric component of the GVCPLM, we extend gen- 
eralized likelihood ratio tests (GLRT, Fan, Zhang and Zhang [10]) from 
fully nonparametric models to semiparametric models. We show the Wilks 
phenomenon for model (1.1): the limiting null distribution of the proposed 
GLRT does not depend on the unknown nuisance parameter and it follows 
a chi-square distribution with diverging degrees of freedom. This allows us 
to easily obtain critical values for the GLRT using either the asymptotic 
chi-square distribution or a bootstrap method. 

The paper is organized as follows. In Section 2, we first propose a class of 
variable selection procedures for the parametric component via the noncon- 
cave penalized likelihood approach and then study the sampling properties 
of the proposed procedures. In Section 3, variable selection procedures are 
proposed for the nonparametric component using GLRT. The limiting null 
distribution of the GLRT is derived. Monte Carlo studies and an applica- 
tion involving real data are presented in Section 4. Regularity conditions 
and technical proofs are presented in Section 5. 

2. Selection of significant variables in the parametric component. Sup- 
pose that {Ui, Xj, Zj, Yi}, i = 1, . . . , n, constitute an independent and iden- 
tically distributed sample and that conditionally on {{7j,Xj,Zj}, the condi- 
tional quasi-likelihood of Yi is Q{fj,(Ui, Xj, Zj), Yi}, where the quasi-likelihood 
function is defined by 



for a specific variance function V(s). Throughout this paper, Xj is 
p-dimensional Zj is (i-dimensional and U is univariate. The methods can 
be extended for multivariate U in a similar way without any essential diffi- 
culty. However, the extension may not be very useful in practice due to the 
"curse of dimensionality." 

2.1. Penalized likelihood. Denote by £(a,(3) the quasi-likelihood of the 
collected data {(Ui, Xj, Zj, Yj), i = 1, . . . , n}. That is, 




y s-y 



ds 



£f V(S) 



n 



£(a,0) = Y J Q[g~ 1 {*J<x(U l ) + Zf(3},Yi\. 



4 



R. LI AND H. LIANG 



Following Fan and Li [9], define the penalized quasi-likelihood as 



d 



(2.1) C(cc,(3)=£(cx,(3)-nY^Px j m)> 



where PXj(-) is a prespecified penalty function with a regularization pa- 
rameter Xj, which can be chosen by a data-driven criterion such as cross- 
validation (CV) or generalized cross-validation (GCV, Craven and Wahba 
[6]). Note that the penalty functions and regularization parameters are not 
necessarily the same for all j. For example, we wish to keep some impor- 
tant variables in the final model and therefore do not want to penalize their 
coefficients. 

Before we pursue this further, let us briefly discuss how to select the 
penalty functions. Various penalty functions have been used in the liter- 
ature on variable selection for linear regression models. Let us take the 
penalty function to be the Lq penalty, namely, p\ 3 (\(3\) = 0.5A|J(|/3| ^ 0), 

where /(•) is the indicator function. Note that Ylj=iI(\Pj\ 7^0) equals the 
number of nonzero regression coefficients in the model. Hence, many popular 
variable selection criteria such as AIC (Akaike [1]), BIC (Schwarz [22]) and 
RIC (Foster and George [11]) can be derived from a penalized least squares 
problem with the L$ penalty by choosing different values of Xj, even though 
these criteria were motivated by different principles. Since the Lq penalty 
is discontinuous, it requires an exhaustive search of all possible subsets of 
predictors to find the solution. This approach is very expensive in compu- 
tational cost when the dimension d is large. Furthermore, the best subset 
variable selection suffers from other drawbacks, the most severe of which is 
its lack of stability, as analyzed, for instance, by Breiman [3]. 

To avoid the drawbacks of the best subset selection, that is, expensive 
computational cost and the lack of stability, Tibshirani [24] proposed the 
LASSO, which can be viewed as the solution of the penalized least squares 
problem with the L\ penalty, defined by PAj (|/3|) = Aj|/3|. Frank and Fried- 
man [12] considered the L q penalty, P\j{\@\) = Xj\(3\ q , < q < 1, which yields 
a "bridge regression." The issue of the selection of the penalty function has 
been studied in depth by various authors, for instance, Antoniadis and Fan 
[2] . Fan and Li [9] suggested using the smoothly clipped absolute deviation 
(SCAD) penalty, defined by 



with PAj (0) = 0- This penalty function involves two unknown parameters, 
Xj and a. Arguing from a Bayesian statistical point of view, Fan and Li [9] 
suggested using a = 3.7. This value will be used in Section 4. 






(a - l)Xj 



for some a > 2 and (3 > 0, 
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Since a(-) consists of nonparametric functions, (2.1) is not yet ready 
for optimization. We must first use local likelihood techniques (Fan and 
Gijbels [7]) to estimate ot(-), then substitute the resulting estimate into 

(2.1) and finally maximize (2.1) with respect to (3. We can thus obtain a 
penalized likelihood estimate for (3. With specific choices of penalty function, 
the resulting estimate of (3 will contain some exact zero coefficients. This 
is equivalent to excluding the corresponding variables from the final model. 
We thus achieve the objective of variable selection. 

Specifically, we linearly approximate ctj(v) for v in a neighborhood of u 

by 

aj(v) ctj(u) + a'j(u)(v — u) = aj + bj(v — u). 

Let a = (oi, . . . , a p ) T and b = (b\, . . . , b p ) T . The local likelihood method is 
to maximize the local likelihood function 

n 

(2.2) ]T Q[^- 1 {a r X i + b T Xi(Ui -u) + 7^/3},Y^K h (Ui - u) 

i=l 

with respect to a, b and (3, where K(-) is a kernel function and Kh(t) = 
h~ l K(t/h) is a rescaling of K with bandwidth h. Let {a,b,/3} be the solu- 
tion of maximizing (2.2). Then 

a(u) = a. 

As demonstrated in Lemma 2, a is \/n7i-consistent, but its efficiency can 
be improved by the estimator proposed in Section 3.1. As (3 was estimated 
locally, the resulting estimate (3 does not have root-n convergence rate. To 
improve efficiency, (3 should be estimated using global likelihood. 

Replacing a. in (2.1) by its estimate, we obtain the penalized likelihood 

(2.3) C P (^) = J2Q{g- 1 (Xfa(U l ) + Zff3),Y i } -nj^PxAW)- 

i=l j=l 

Maximizing Cp(f3) results in a penalized likelihood estimator (3. The pro- 
posed approach is in the same spirit as the one-step backfitting algorithm 
estimate, although one may further employ the backfitting algorithm method 
with a full iteration or profile likelihood approach to improve efficiency. The 
next theorem demonstrates that (3 performs as well as an oracle estimator 
in an asymptotic sense. Compared with fully iterated backfitting algorithms 
and profile likelihood estimation, the newly proposed method is much less 
computationally costly and is easily implemented. For high-dimensional X- 
and Z-variables, the Hessian matrix of the local likelihood function (2.2) 
may be nearly singular. To make the resulting estimate stable, one may ap- 
ply the idea of ridge regression to the local likelihood function. See Cai, Fan 
and Li [4] for a detailed implementation of ridge regression to local likelihood 
modeling. 
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2.2. Sampling properties. We next study the asymptotic properties of 
the resulting penalized likelihood estimate. We first introduce some nota- 
tion. Let c*o(") an d flo denote the true values of a(-) and /3, respectively. 
Furthermore, let (3 = (/3io, • • • ,Pdo) T = (flxo, 02o) T ■ For ease °f presentation 
and without loss of generality, it is assumed that /3 10 consists of all nonzero 
components of (3 and that /3 2 o = 0- Let 

(2-4) a n = max {\p' x .(|& |)|,&o / 0} 

l<j<a J 



and 



maoclK .(|/3 i0 |)|,/%)^0}. 

1<J<CI J 



Theorem 1. Under the regularity conditions given in Section 5, if a n — > 
0, b n — > 0, n/i 4 — > and nh 2 /log(l/h) — ► oo as n-> oo, £/ien i/iere exists a 
local maximizer $ of £p(f3) defined in (2.3) such that its rate of convergence 
is Op{n~ 1 / 2 + a n ), where a n is given in (2.4)- 

We require further notation in order to present the oracle properties of 
the resulting penalized likelihood estimate. Define 

bn = {p^aftoDsgnG&o), • • • ,p' Xs (\Pso\) sgn(Ao)} T 

and 

S^diagK^IAol),...,^^!^!)}, 

where s is the number of nonzero components of (3 . Let /ij = f t ] K{t)dt 
and Vj = J PK 2 (t) dt for j = 0, 1, 2. Define 

P/() = ^{^W ' 

and let qi(x, y) = p±(x){y — g~ 1 (x)}. Let R = ccp (J7)X + Zf/3 10 and 



(2.5) £(u)=£ 



P2(R) 



17 = U 



ZX r ZZ T 

Denote by kj. the fcth element of qi(R,Y)Yl~ 1 (u)('X. r [ ,Zj) T and define 

p 

ri(«) = Kfc^^C^fcZ!^ = «]. 
fc=i 

Theorem 2. Suppose that the regularity conditions given in Section 5 
hold and that 

lim inf lim inf A~ 1 p'\ . ( I/?,- 1 ) > 0. 

n^oo /§•- »0+ jni-AjnVir-jiy 
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If yfn\j n — > oo, nh 4 — > and nh 2 /log (1/h) — ► oo as n — > oo, then the root-n 

consistent estimator (3 in Theorem 1 must satisfy (3 2 = 0> an d V^C^i + 

V x ){Pi ~ Pw + (Bi + Sa)" 1 ^} N(0, E), w/tere B x = [p 2 (i?)ZiZf ] and 
S = var{o 1 ( J R,y)Z 1 -r 1 ([/)}. 

Theorem 2 indicates that undersmoothing is necessary in order for (3 to 
have root-n consistency and asymptotic normality. This is a standard result 
in generalized partially linear models; see Carroll et al. [5] for a detailed 
discussion. Thus, special care is needed for bandwidth selection, as discussed 
in Section 3.1. 

2.3. Issues arising in practical implementation. Many penalty functions 
p\.(\(3j\), including the L\ penalty and the SCAD penalty, are irregular at 
the origin and may not have a second derivative at some points. Direct im- 
plementation of the Newton-Raphson algorithm may be difficult. Following 
Fan and Li [9], we locally approximate the penalty function by a quadratic 
function at every step of the iteration, as follows. Given an initial value (3^ 
that is close to the maximizer of the penalized likelihood function, when (3^ 
is not very close to 0, the penalty Pa,, (1/3? I) can be locally approximated by 
the quadratic function as 

Otherwise, set f3j = 0. In other wor we have 

PxMD^p^Wf l) + IH(l^ 0) l)/l/3f |}(/?f-/f )2 ). 

For instance, this local quadratic approximation for the L\ penalty yields 

With the aid of the local quadratic approximation, the Newton-Raphson 
algorithm can be modified to search for the solution of the penalized likeli- 
hood. The convergence of the modified Newton-Raphson algorithm for other 
statistical settings has been studied by Hunter and Li [16]. 

Standard error formula for f3. The standard errors for estimated param- 
eters can be obtained directly because we are estimating parameters and 
selecting variables at the same time. Following the conventional technique 
in the likelihood setting, the corresponding sandwich formula can be used 
as an estimator for the covariance matrix of the estimates (3. Specifically, 
let 

0l(q,/3) d 2 l(a,/3) 
w ~ df3 ' [ P>- d(3d(3 T 
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and 

s^) = d Mg ( n]sr> ..., n ^ r 

The corresponding sandwich formula is then given by 

cov(£S) = - nJ: x (P)r l cSv{£'0)}{f0) - nE x 0)}~\ 

This formula can be shown to be a consistent estimator and will be shown 
to have good accuracy for moderate sample sizes. 

Choice of Aj's. We suggest selecting the tuning parameters Xj using 
data-driven approaches. Similarly to Fan and Li [9], we will employ gen- 
eralized cross-validation (GCV) to select the A,'s. In the last step of the 
Newton-Rap hson iteration, we may compute the number of effective pa- 
rameters: 

e(Ai, . . . , X d ) = tr[{£"0) - nZ x 0)}- x t' 
The GCV statistic is defined by 

Eti DiYug-^&iUi) + Zf 0(A))} 



GCV(Ai,...,A d ) 



n{l - e(Ai,...,A d )/np 



where D{Y, fj,} denotes the deviance of Y corresponding to the model fit 
with A. The minimization problem over a d-dimensional space is difficult. 
However, it is expected that the magnitude of Xj should be proportional to 
the standard error of the unpenalized maximum pseudo-partial likelihood es- 
timator of (3j. In practice, we suggest taking Xj = XSE(/3j), where SE(/3j) 

is the estimated standard error of the unpenalized likelihood estimate. 
Such a choice of Xj works well from our simulation experience. The min- 
imization problem will thus reduce to a one-dimensional problem and the 
tuning parameter can be estimated by means of a grid search. 

3. Statistical inferences for nonparametric components. In this section, 
we will propose an estimation procedure for cc(-) and extend the generalized 
likelihood ratio test from nonparametric models to model (1.1). 

3.1. Estimation of nonparametric component. Replacing (3 in (2.2) by 
its estimate 0, we maximize the local likelihood function 

n 

(3.1) £ Q[ & - 1 {a T X i + b T X i (^ - u) + Zjp}, Y^K h {Ui - u) 
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with respect to a and b. Let {a, b} be the solution of maximizing (3.1) and 
let a(u) = a. Similarly to Cai, Fan and Li [4], we can show that 

{nh) l ' 2 \a{u) - a (u) - ^{u)h 2 } ivjt), jr^M™) 

where £*(tt) = (E[p 2 {o$(U)X + /3^Z}XX T |C7 = u])" 1 and where f(u) is 
the density of U. Thus, a(u) has conditional asymptotic bias 0.5/i 2 fi2 a o(u) + 
op(h 2 ) and conditional asymptotic covariance {nh)~ 1 VQT lif (u)f~ l {u)+op{^). 
From Lemma 2, the asymptotic bias of a is the same as that of a, while 
the asymptotic covariance of a is smaller than that of ex. 

A theoretic optimal local bandwidth for estimating the elements of cc(-) 
can be obtained by minimizing the conditional mean squared error (MSE) 
given by 

£{||a( u )- Q (u)|| 2 |z,x} = i^ 

where || • || is the Euclidean distance. Thus, the ideal choice of local band- 
width is 

~ f mtr{S,(u)} I 1 / 5 1/5 

^~\/(«)/i2K(u)|pJ 

With expressions for the asymptotic bias and variance, we can also derive a 
theoretic or data-driven global bandwidth selector by utilizing the existing 
bandwidth selection techniques for the canonical univariate nonparametric 
model, such as the substitution method (see, e.g., Ruppert, Sheather and 
Wand [20]). For the sake of brevity, we omit the details here. 

As usual, the optimal bandwidth will be of order n" 1 / 5 . This does not 
satisfy the condition in Theorems 1 and 2. A good bandwidth is generally 
generated by h opt x n -2 / 15 = 0(n -1 / 3 ). 

In order for the resulting variable selection procedures to possess an 
oracle property, the bandwidth must satisfy the conditions nh 4 — ► and 
nh? I (log n) 2 — ► oo. The aforementioned order of bandwidth satisfies these 
requirements. This enables us to easily choose a bandwidth either by data- 
driven procedures or by an asymptotic theory-based method. 

3.2. Variable selection for the nonparametric component. After obtain- 
ing nonparametric estimates for {ai(-), . . . ,a p (-)}, it is of interest to select 
significant A-variables. For linear regression models, one conducts an F-test 
at each step of the traditional backward elimination or forward addition pro- 
cedures. The purpose of variable selection may be achieved by a sequence 
of .F-tests. Following the strategy of the traditional variable selection proce- 
dure, one may apply the backward elimination procedure to select significant 
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variables. In each step of the backward elimination procedure, we essen- 
tially test the following hypothesis 

Hq : ctj 1 («) = ••• = aj k (u) = versus H\ : not all ctj t (u) / 

for some {ji, . . . , j^}, a subset of {1, ... ,p}. The purpose of variable selection 
may be achieved by a sequence of such tests. For ease of presentation, we 
here consider the following hypothesis: 

(3.2) Hq : ax («) = ••• = a p (u) = versus H\ : not all ctj (u) / 0. 

The proposed idea is also applicable to more general cases. 

Let a(u) and j3 be the estimators of a{u) and (3 under the alternative 
hypothesis, respectively, and let (3 be the estimators of (3 under the null 
hypothesis. Define 

n 

n{H x ) = Q{g~ l (S T (^)xf + zf 3), y,} 

and 

n 

ft(#o) = £Q{<r 1 (zf/3)^}- 

i=l 

Following Fan, Zhang and Zhang [10], we define a generalized quasi-likelihood 
ratio test (GLRT) statistic 

T GLR = r K {K(H 1 )-K(H )}, 

where 

r K = (k(0) -0.5 J K 2 (u)duj^J{K(u) -0.5K *K(u)}duj . 



Theorem 3. Suppose that the regularity conditions given in Section 5 
hold and that nh 8 — ► and nh 2 /(logn) 2 — > cxd. Under Hq in (3.2), the test 
statistic Tglr has an asymptotic x 2 distribution with df n degrees of freedom, 
in the sense of Fan, Zhang and Zhang [10], where df n = ^^^({^(O) — 
0.5 / K 2 {u)du}/h and where \Vt\ denotes the length of the support of U . 

Theorem 3 reveals a new Wilks phenomenon for semiparametric infer- 
ence and extends the generalized likelihood ratio theory (Fan, Zhang and 
Zhang [10]) for semiparametric modeling. We will also provide empirical 
justification for the null distribution. Similarly to Cai, Fan and Li [4], the 
null distribution of Tglr can be estimated using Monte Carlo simulation 
or a bootstrap procedure. This usually provides a better estimate than the 
asymptotic null distribution since the degrees of freedom tend to infinity 
and the results in Fan, Zhang and Zhang [10] only give the main order of 
the degrees of freedom. 
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4. Simulation study and application. In this section, we conduct exten- 
sive Monte Carlo simulations in order to examine the finite sample perfor- 
mance of the proposed procedures. 

The performance of the estimator q(-) will be assessed by using the square 
root of average square errors (RASE) 

{"grid i 1/2 

vldEll«K)-«K)f | > 

where {uk,k = 1, . . . , ra gr jd} are the grid points at which the functions {uj(-)} 
are evaluated. In our simulation, the Epanechnikov kernel K(u) = 0.75(1 — 
u 2 ) + and rigrid = 200 are used. 

In an earlier version of this paper (Li and Liang [17]), we assessed the 
performance of the proposed estimation procedure for /3 without the task 
of variable selection and concluded that the proposed estimation procedures 
performs well. We have since further tested the accuracy of the proposed 
standard error formula and found that it works fairly well. In this section, 
we focus on the performance of the proposed variable selection procedures. 
The prediction error is defined as the average error in the prediction of the 
dependent variable given the independent variables for future cases that are 
not used in the construction of a prediction equation. Let {U* , X*, Z*, Y*} 
be a new observation from the GVCPLM model (1.1). The prediction error 
for model (1.1) is then given by 

PE(a,/3) = E{Y* -/i(f/*,X*,Z*)} 2 , 

where the expectation is a conditional expectation given the data used in 
constructing the prediction procedure. The prediction error can be decom- 
posed as 

PE(«,/3) = E{Y* - fi(U* ,X.* ,Z*)} 2 + E{fi(U* ,X.* ,Z*) - ^([T, X*, Z*)} 2 . 

The first component is the inherent prediction error due to noise. The sec- 
ond component is due to the lack of fit with an underlying model. This 
component is called the model error. Note that a, (3 provide a consistent 
estimate and that fi(U*,X.*, Z*) = g~ l {Z* T a(U*) + Z* T (3}. By means of a 
Taylor expansion, we have the approximation 

/}(£/*, X*,Z*) 

« fj.(U*, X*, Z*) + g~ 1 {X.* T a(U*) + Z* T (3}X.* T {a(U*) - a(U*)} 

+ g~ 1 {X.* T a(U*) + Z* T P}Z* T - 0), 

where <7 _1 (i) = dg~ 1 (t)/dt. Therefore, the model error can be approximated 
by 

E[g~ 1 {X.* T a{U*) + Z* T (3}} 2 ([X* T {a{U*) - a(U*)}} 2 + [Z* T (p - (3)} 2 

+ [X* T {a(U*) - a(U*)}} x [Z* T (p - (3)])- 
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The first component is the inherent model error due to the lack of fit of the 
nonparametric component cto(t), the second is due to the lack of fit of the 
parametric component and the third is the cross product between the first 
two components. Thus, we define generalized mean square error (GMSE) for 
the parametric component as 

(4.2) GMSE(/3) = £[Z* T (/3 - f3)} 2 = (/I - f3)E(Z*Z* T )(P - (3) 

and use the GMSE to assess the performance of the newly proposed variable 
selection procedures for the parametric component. 

Example 4.1. In this example, we consider a semivarying Poisson re- 
gression model. Given (U, X,Z), Y has a Poisson distribution with mean 
function [i(U, X, Z) where 

fi(U, X, Z) = exp{X T a(C7) + Z T (3}. 

In our simulation, we take U ~ J7(0, 1), X = (Xi,X2) T with X\ = 1 and 
X 2 ~ iV(0,l), ai(u) = 5.5 + 0.1exp(2u - 1) and a 2 (u) = 0.8n(l - it). Fur- 
thermore, (3= [0.3, 0.15, 0,0, 0.2, 0,0,0,0,0] T and Z has a 10-dimensional 
normal distribution with zero mean and covariance matrix (<7jj)ioxio with 
(Jij = 0.5'* - -''. In our simulation, we take the sample size n = 200 and band- 
width h = 0.125. 



Performance of procedures for (3. Here, we compare the variable selec- 
tion procedures for Z. One may generalize the traditional subset selection 
criteria for linear regression models to the GVCPLM by taking the penalty 
function in (2.1) to be the L penalty. Specifically, p Xj (|/?|) = 0.5A]/(|/?| / 0). 
We will refer to the AIC, BIC and RIC as the penalized likelihood with 
the L penalty with Xj = i/2/n, ^J\og{n)/n and ^21og(cf)/n, respectively. 



Table 1 
Comparisons of variable selection 





Poisson with h 


— 0.125 




Logistic with h = 0.3 






RGMSE 






RGMSE 






Penalty 


Median(MAD) 


C 


I 


Median(MAD) 


C 


I 


SCAD 


0.3253 (0.2429) 


6.8350 





0.5482 (0.3279) 


6.7175 





Li 


0.8324 (0.1651) 


4.9650 





0.7247 (0.2024) 


5.3625 





AIC 


0.7118 (0.2228) 


5.6825 





0.8353 (0.1467) 


5.7225 





BIC 


0.3793 (0.2878) 


6.7400 





0.5852 (0.3146) 


6.9100 





RIC 


0.4297 (0.2898) 


6.6475 





0.6665 (0.2719) 


6.7100 





Oracle 


0.2750 (0.1983) 


7 





0.5395 (0.3300) 


7 
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Table 2 
Computing times 





Penalty 


d = 8 


d = 9 


d = 10 


Poisson 


SCAD 


0.0485 (0.0145) 


0.0584 (0.0155) 


0.0471 (0.0132) 




Li 


0.0613 (0.0145) 


0.0720 (0.0217) 


0.0694 (0.0188) 




BIC 


0.8255 (0.1340) 


2.1558 (0.6836) 


4.6433 (1.3448) 


Logistic 


SCAD 


2.7709 (0.1606) 


2.8166 (0.1595) 


2.8337 (0.1449) 




Li 


8.1546 (0.8931) 


7.9843 (0.9196) 


8.1952 (0.9491) 




BIC 


61.5723 (1.4404) 


131.8402 (2.6790) 


280.0237 (6.6325) 



Since the L penalty is discontinuous, we search over all possible subsets 
to find the correspond solutions. Thus, these procedures will be referred 
to as best subset variable selection. We compare the performance of the 
penalized likelihood with the L\ penalty and the SCAD penalty with the 
best subset variable selection in terms of GMSE and model complexity. 
Let us define relative GMSE to be the ratio of GMSE of a selected final 
model to that of the full model. The median of relative GMSE over the 
400 simulations, along with the median of absolute deviation divided by a 
factor of 0.6745, is displayed in the column of Table 1 labeled "RGMSE." 
The average number of coefficients is also given in Table 1, where the 
column labeled "C" gives the average number of the seven true zero co- 
efficients, correctly set to zero and the column labeled "I" gives the aver- 
age number of the three true nonzeros incorrectly set to zero. In Table 1, 
"Oracle" stands for the oracle estimate computed by using the true model 
g{E(y\u, x, z)} = x T a(it) + (i\Z\ + P2Z2 + According to Table 1, the 

performance of the SCAD is close to that of the oracle procedure in terms 
of model error and model complexity, and it performs better than penalized 
likelihood with the L\ and best subset variable selection using the AIC and 
RIC. The performance of the SCAD is similar to that of best subset vari- 
able selection with BIC. However, best subset variable selection demands 
much more computation. To illustrate this, we compare the computing time 
for each procedure. Table 2 includes the average and standard deviation of 
computing times over 50 Monte Carlo simulations for d = 8, 9 and 10. For 
d = 8 and 9, j3\ = 0.3, P2 = 0.15, (3§ = 0.2 and other components of (3 = 0; Z 
has multivariate normal distribution with zero mean and the same covari- 
ance structure as that for d = 10; U, X and ol(U) are the same as those for 
d = 10. We include only the computing time for BIC in Table 2. Computing 
time for AIC and RIC is almost identical to that for BIC. It is clear from 
Table 2 that BIC needs much more computing time than the SCAD and L\ 
and that it exponentially increases as d increases. 
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Performance of procedures for a(u). It is of interest to assess the impact 
of estimation of j3 on the estimation of cc(-). To this end, we consider two 
scenarios: one is to estimate a(-) using the proposed backfitting algorithm, 
and the other is to estimate a(-) with the true value of (3. The plot of one 
RASE versus the other is depicted in Figure 1(a), from which it can be seen 
that the estimate a using the proposed backfitting algorithm performs as 
well as if we knew the true value of f3. This is consistent with our theoretic 
analysis because j3 is root-n consistent and this convergence rate is faster 
than the convergence rate of a nonparametric estimate. 

We now assess the performance of the test procedures proposed in Section 
3. Here, we consider the null hypothesis 

Hq : a2 (u) = versus Hi : 0:2 (u) ^= 0. 

We first examine whether the finite sample null distribution of the pro- 
posed GLRT is close to a chi-square distribution. To this end, we conduct 
1000 bootstraps to obtain the null distribution of the proposed GLRT. The 
kernel density estimate of the null distribution is depicted in Figure 1(c), 
in which the solid line corresponds to the estimated density function and 
the dotted line to a density of the x 2_ distribution with degrees of freedom 
approximately equaling the sample mean of the bootstrap sample. From 
Figure 1(c), the finite sample null distribution is quite close to a chi-square 
distribution. 

We next examine the Type I error rate and power of the proposed GLRT. 
The power functions are evaluated under a sequence of alternative models 
indexed by 5: 

Hi : at2(u) = 5 x 0.8u(l — u). 

Figure 1(e) depicts four power functions based on 400 simulations at four 
different significance levels: 0.25, 0.1, 0.05 and 0.01. When 5 = 0, the special 
alternative collapses into the null hypothesis. The powers at 5 = for these 
four significance levels are 0.2250, 0.0875, 0.05 and 0.0125. This shows that 
the bootstrap method gives a correct Type I error rates. The power functions 
increase rapidly as 5 increases. This shows that the proposed GLRT works 
well. 

Example 4.2. In this example, we consider a semivarying coefficient 
logistic regression model. Given (U, X, Z), Y has a Bernoulli distribution 
with success probability p(U, X, Z), where 

p(CT,X, Z) = exp{X T a(C7) + Z T /3}/[l + exp{X T a(C7) + Z T /3}]. 

In our simulation, U, X, Z are the same as those in Example 4.1, but the 
coefficient functions are taken to be 

a±(u) = exp(2u — 1), 012(11) = 2sin 2 (27ra) 
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(a) Plot of RASEs (Poisson) 



(b) Logistic (Logistic) 




(e) Power (unction (Poisson) 



(f) Power function (Logistic) 





Fig. 1. Plots for Examples 4- 1 (left panel) and 4-2 (right panel) . In (a) and (b), RASEo 
stands for the RASE of a(u) with the true [3 and RASEi stands for the RASE of a(u) 
using the backfitting algorithm. In (c) and (d), the solid lines correspond to the estimated 
null density and the dotted lines to the density of the \ 2 -distribution with df n being the 
mean of the bootstrap sample, (e) and (f) are power functions of the GLRT at levels 0.25, 
0.10, 0.05 and 0.01. 
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and (3 = [3, 1.5, 0, 0, 2, 0, 0, 0, 0, 0] . We conduct 400 simulations and in each 
simulation, the sample size n is set at 1000 and the bandwidth at h = 0.3. 

Performance of procedures for (3. We investigate the performance of the 
proposed variable selection procedures. Simulation results are summarized 
in the rightmost column of Table 1, from which we can see that the SCAD 
performs the best and that its performance is very close to that of the oracle 
procedure. We employ the same strategy as in Example 4.1 to compare com- 
puting times of each variable selection procedure. The mean and standard 
deviation of computing time are given in the bottom row of Table 2, from 
which it can be seen that the computing time for the best subset variable 
selection procedure increases exponentially as the dimension of (3 increases, 
while this is not the case for penalized likelihood with the SCAD penalty 
and the L\ penalty. 

Performance of procedures for a(u). We employ RASE to assess the 
performance of a(u). Figure 1(b) plots the RASE of cc(-) using the proposed 
backfitting algorithm against that of a(-) using the true value of f3. The 
performance of the backfitting algorithm is quite close to that using the 
true value of (3. 

We next examine the performance of the proposed GLRT for logistic 
regression. Here, we consider the null hypothesis 

Hq : a.2 (u) = versus Hi:ct2(u)^0. 

The estimated density of null distribution is depicted in Figure 1(d), from 
which we can see that it is close to a ^-distribution. The power functions 
are evaluated under a sequence of alternative models indexed by 5: 

Hi : a 2 (u) = 5 x 2 sin 2 (2ttu) . 

The power functions are depicted in Figure 1(d), from which it can be seen 
that the power functions increase rapidly as 5 increases. 

Example 4.3. We now apply the methodology proposed in this paper to 
the analysis of a data set compiled by the General Hospital Burn Center at 
the University of Southern California. The binary response variable Y is 1 for 
those victims who survived their burns and otherwise, the variable U in this 
application represents age and fourteen other covariates were considered. 
We first employ a generalized varying-coefficient model (Cai, Fan and Li 
[4]) to fit the data by allowing the coefficients of all fourteen covariates to 
be age-dependent. Based on the resulting estimates and standard errors, we 
consider a generalized varying-coefficient partially linear model for binary 
response, 

(4.3) logit{£(Y|C/ = u, X = x, Z = z)} = x T a(u) + z T (3, 
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where X\ = 1 and a\(u) is the intercept function. Other covariates are as 
follows. X2 stands for log (burn area + 1), X3 for prior respiratory disease 
(coded by for none and 1 for yes), Z\ for gender (coded by for male and 
1 for female), Z2 for days injured prior to admission date (coded by for 
one or more days and 1 otherwise), Z% for airway edema (coded by for not 
present and 1 for present), Z4 for sootiness (coded by 1 for yes and for no), 
Z5 for partial pressure of oxygen, Z§ for partial pressure of carbon dioxide, 
Z7 for pH (acidity) reading, Zg for percentage of CbHg, Z9 for oxygen supply 
(coded by for normal and 1 for abnormal), Z\q for carbon dioxide status 
(coded by for normal and 1 for abnormal), Z\\ for acid status coded by 
(0 for normal and 1 for abnormal) and Z\i for hemo status (coded by for 
normal and 1 for abnormal). 

In this demonstration, we are interested in studying how the included 
covariates affect survival probabilities for victims in different age groups. 
We first employ a multifold cross-validation method to select a bandwidth. 
We partition the data into K groups. For each j, k = 1, . . . , K, we fit the 
data to model (4.3), excluding data in the kth group, denoted by V^. The 
deviance is computed. This leads to a cross-validation criterion, 

K 

k=nev k 

where D(y,fi) is the deviance of the Bernoulli distribution, /t_fc(iij,Xj,Zj) is 
the fitted value of Y{, that is, logit - {x? 1 a._k(ui) + z[(3_k}, and «-&(•) and 
(3-k are estimated without including data from T>^. In our implementation, 
we set K = 10. Figure 2(a) depicts the plot of cross-validation scores over 
the bandwidth. The selected bandwidth is 48.4437. With the selected band- 
width, the resulting estimate of ot(u) is depicted in Figures 2(b), (c) and (d). 
From the plot of 0:3(1*) in Figure 2(d), we see that the 95% pointwise con- 
fidence interval almost covers zero. Thus, it is of interest to test whether or 
not A3 is significant. To this end, we employ the semiparametric generalized 
likelihood ratio test procedure for the following hypothesis: 

Hq: 03(11) = versus Hi : 03(7/) 7^ 0. 

The resulting generalized likelihood ratio test for this problem is 15.7019, 
with a P value of 0.015, based on 1,000 bootstrap samples. Thus, the co- 
variate prior respiratory disease is significant at level 0.05. The result also 
implies that the generalized likelihood ratio test is quite powerful as the 
resulting estimate of 03 (u) only slightly deviates from 0. 

We next select significant z-variables. We apply the SCAD procedure 
proposed in Section 2 to the data. The tuning parameter A is chosen by 
minimizing the GCV scores. The selected A equals 0.4226. With this se- 
lected tuning parameter, the SCAD procedure yields a model with only 
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(a) Cross-validation Scores (b) Intercept Function 




20 40 60 80 100 20 40 60 80 100 



Age Age 

Fig. 2. Plots for Example 4.3. 

three ^-variables: Z3, Z5 and Z7. Their estimates and standard errors are 
-1.9388(0.4603), -0.0035(0.0054) and -0.0007(0.0006), respectively. As a 
result, we recommend the following model: 

Y = di(U) + d 2 (U)X 2 + d 3 (U)X 3 - 1.9388Z 3 - 0.0035^5 - 0.0007Z 7 , 

where the a(£/)'s and their 95% confidence intervals are plotted in Figure 2. 

5. Proofs. For simplicity of notation, in this section, we absorb a 2 into 
V(-), so that the variance of Y given (U, X, Z) is V{fi(U, X, Z)}. Define 
q e (x,y) = {d t /dx i )Q{g- 1 {x),y} for t= 1,2,3. Then 

qi(x,y) = {y- g~ 1 {x)}p 1 (x), 

(5.1) 

Q2(x,y) = {y- g 1 (x)}p[(x) - p 2 {x), 

where pe(t) = { dg dt ^ Y /V{g~ 1 (t)} was introduced in Section 2. In the fol- 
lowing regularity conditions, u is a generic argument for Theorem 2 and the 
condition must hold uniformly in u for Theorems 1-3. 
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Regularity conditions: 

(i) The function q2(x, y) < for a;GK and y in the range of the response 
variable. 

(ii) The random variable U has bounded support CI. The elements of 
the function ol'q(-) are continuous in u € Q. 

(iii) The density function f(u) of U has a continuous second derivative. 

(iv) The functions V"(-) and </"(•) are continuous. 

(v) With R = ag(U)X + Z T (3 , E{q\(R,Y)\U = u}, E{q\(R, Y)Z\U = 
u} and E{qf{R,Y)ZZ T \U = u} are twice differentiate in a £ fl. Moreover, 
E{ql(R, Y)} < oo and E{q\ +S {R, Y)} < oo for some 5 > 2. 

(vi) The kernel K is a symmetric density function with bounded support. 

(vii) The random vector Z is assumed to have bounded support. 

Condition (i) is imposed so that the local likelihood is concave in the pa- 
rameters, which ensures the uniqueness of the solution. Conditions (vi) and 
(vii) are imposed just to simplify the proofs; they can be weakened signif- 
icantly at the expense of lengthy proofs. In our proofs, we will repeatedly 
use the following lemma, a direct result of Mack and Silverman [18]. 

Lemma 1. Let (Xi,Yi), . . . , (X. n ,Y n ) be i.i.d. random vectors, where the 
Yi s are scalar random variables. Assume further that E\Y\ r < oo and that 
sup x / |y| r /(x, y) dy < oo, where f denotes the joint density of (X, Y) . Let K 
be a bounded positive function with bounded support, satisfying a Lipschitz 
condition. Then 



sup 

x6D 



n 



n 

-1 ' 



Y,{K h (Xi - x)y, - ^(x, - x)y,]} 

provided that n 2e ~ 1 h — > oo for some e < 1 — r . 



Op 



nh \ -V 2 
log(lA) J 




To establish asymptotic properties of j3, we first study the asymptotic 
behaviors of a, b and (3. Let us introduce some notation. Let on = cki(u) = 
X T a Q {u) + Zj /3 + (Ui - u)Xja' (u). Write X* = (Xf , (£/; - u)X.f/h, Zf) T , 

A(X,Z) 

and 

B(X,Z) = 

Denote the local likelihood estimate in (2.2) by a, b and j3 and let 
P* = y^{( a - a (n)) T , h(b - a' {u)f, - (3 ff. 
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We then have the following asymptotic representation of (3 . 

Lemma 2. Under the regularity conditions given in Section 5, 
and nh — > oo as n^oo, then (3 = A _1 W n + Op{h 2 + c n log 1 ^ 2 (l//i)} holds 
uniformly in u£Q, the support of U , where 

I n 

W„ = Tjh/n^qifaYiWKhiUi-u) 
i=i 

and 

A = f(u)E[p 2 (c$(U)X + Z T (3 )A(X, Z)\U = u}. 

By some direct calculation, we then have the following mean and variance 
of W n : 

EW n = y/nhfjaft 1 \u)h 2 f (u) E[p 2 {a^ (U)X + Z T /3 }(X T , 0, Z T ) T X\U = u) 
+ o(c- 1 / i 2 ) 

and 

var(W n ) = f(u)E[p 2 {c^(U)X + Z T /3 }5(X, Z)\U = u] + o(l). 

Since W n is a sum of independent and identically distributed random vec- 
tors, the asymptotic normality of a, b and (3 can be established by using the 
central limit theorem and the Slutsky theorem. The next two theorems show 
that the estimate (3 can be improved by maximizing the penalized likelihood 
(2.3). 

Proof of Lemma 2. Throughout this proof, terms of the form G(u) = 
Op(a n ) always stand for sup ng Q 

\G(u)\ = P (a n ). Let c n = {nh)~ 1 / 2 . If (a,b, 
{3) T maximizes (2.2), then (3 maximizes 

n 

£ n (/3*) = hY}Q{g- l {c n (3* T X* + a i ) 1 Y i } - Qig' 1 ^), Yi}]K h (Ui - u) 

i=l 

with respect to (3*. The concavity of the function £ n ((3*) is ensured by 
condition (i). By a Taylor expansion of the function Q{<7~ 1 (-) ) Y,}, we obtain 
that 

(5.2) £ n ((3*) = Wl(3* + i/3* T A n /3*{l + o P (l)}, 

where A n = hc 2 n Ya=i llip-ii Yi)X.*~K* T Kh(Ui — u). Furthermore, it can be 
shown that 

(5.3) A n = -A + o P (l). 
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Therefore, by (5.2), 

(5.4) £ n (l3*)=Wll3*-±l3* T Af3* + op(l). 

Note that each element in A n is a sum of i.i.d. random variables of ker- 
nel form and hence, by Lemma 1, converges uniformly to its corresponding 
element in A. Consequently, expression (5.4) holds uniformly imi£fi. By 
the Convexity Lemma (Pollard [19]), it also holds uniformly in f3* G C and 
u£Q for any compact set C. Lemma A.l of Carroll et al. [5] then yields 

(5.5) sup|3* - A _1 W n | -^0. 

uefi 

Furthermore, from the definition of (3 , we have that 



d 

-MP* 



~ = CnhY,qi(ai + c„3* T X*, Y^X*X* T K h (JJ % - u)P* = 0. 



d(3 

By using (5.5) and a Taylor expansion, we have 

(5.6) W n + A n (3 +^-^ (?3 (a i + C 4 ,y i )X*{ / 3 X*} 2 K h (U t - u) = 0, 

1 i=i 

where Q is between and c n f3 X?. The last term in the above expression 
is of order Op(c n ||/3*|| 2 ). Since each element in A n is of kernel form, we 
can deduce from Lemma 1 that A„ = EA n + Op{c n log 1//2 (l//i)} = — A + 
P {h 2 + c n log 1/2 (l//i)}. Consequently, by (5.6), we obtain that 

W n - A3*[l + P {h 2 + cnlog^il/h)}} + P (c n ||3*|| 2 ) = 0. 

Hence, 

3* = A- x W n + P {h 2 + c n log 1 / 2 (i// l )} 
holds uniformly for u £ 0. This completes the proof. □ 

Proof of Theorem 1. Let 7„ = n~ 1 / 2 + a n . It suffices to show that 
for any given £ > 0, there exists a large constant C such that 

(5.7) p{ sup^£ P (/3 + 7n v) < £p(/3 )} > 1 - C- 



Define 



-Q{ 5 - 1 (a T (C/ l )X i + Zf/3 ),^}] 
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and D n: 2 = -nY?j=i{P\ n {\Pjo + lnVj\) ~ Px n (\Pjo\}}, where s is the number 
of components of (3 10 . Note that p\ n (0) = and P\ n {\(3\) > for all f3. Thus, 

Cp(P + 7nv) - Cp((3 ) < D nA + D n> 

We first deal with D n l . Let m, = 3 T (£/j)Xj + Zj (3 . Thus, 

n 

(5.8) Am = E[ < 3{5" 1 (^ +7nV T Zi),yi} - Q{ 5 _1 (m^ ,!•}]. 

i=l 

By means of a Taylor expansion, we obtain 



(5.9) D n> i = ^2q 1 (m i ,Yi)~f n v T Z i - ^7^v T B n v, 

i=l 



where B n = n 1 J2?=iP2{g 1 (fhi + (ni)}'Zi'Z>I , with ( ni between and 7 n v T Zj, 
independent of Y. It can be shown that 

(5.10) B n = -Ep 2 {a^(U)X + Z T /3 }ZZ T + o P (l) = -B + o P (l). 
Let mi = o$(Ui)Xi + Zf /3 . We have 

n 



n 

i=i 

n 



n 



-1/2 



^2qi(mi,Yi)Zi 



i=l 

n 

+ n~ l/2 £ <fe(mi, Yi)[{a(Oi) - a (^)} T Xi]Zi 

i=l 

+ o P (?i 1/2 lia-a ||L)- 

By Lemma 2, the second term in the above expression can be expressed as 

n n 

n^' 2 Y J <l2{m i) Y i )f{U i r 1 Y J {WjX)K h {U j - Ui)Zi 

i=l j=l 

+ P {n 1 /2 c 2i og i/2 (1//l)} 
^T^ + OHn^log 1 /^!//*)}, 

where is the vector consisting of the first p elements of qi(rrij, y.,-)£ _1 (u). 
Define Tj = r(Xj,lj, Z,), consisting of the first p elements of 

T ryT\T 



gi (m j ,y,)s- 1 («)(xf,z J 2 
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Using the definition of a>j(Ui), we obtain ctj(Ui) — rrij = 0((Uj — Ui) 2 ) and 
therefore 

n n 

T nl = n" 3 / 2 £ J2 92 ("H, y i )/(^)" 1 (TjXi)ifh(C/ j - Ui)Zi + P (n 1/2 /i 2 ) 
i=lj=l 

It can be shown, by calculating the second moment, that 

(5.11) T n2 -T n3 -^0, 
where T n3 = -rr 1/2 Y^=il{Uj) with 

v 

7(«i) = E ^[p2{«o («) x + Z T /3 }X fc Z|C7 = 

fc=i 

Combining (5.8)-(5.11), we obtain that 

n 

(5.12) £)„,! = 7 nv]T n(Xi, Y;, ZO - §t£v t Bv + o P (l), 

1=1 

where £l(Ui,Yi, Zj) = g , i(mj,l^)Zj — ~f(Ui). The orders of the first term and 
the second term are Op{n l / 2 ^ n ) and Op(nj 2 ), respectively. We next deal 
with D n ^- Note that n D n ^ is bounded by 

\/s7nan||v|| + 7nMv|| 2 = C^sfs + b n C), 

by the Taylor expansion and the Cauchy-Schwarz inequality. As b n — ► 0, 
the second term on the right-hand side of (5.12) dominates D n2 as well as 
the first term on the right-hand side of (5.12), provided C is taken to be 
sufficiently large. Hence, (5.7) holds for sufficiently large C. This completes 
the proof of the theorem. □ 

To prove Theorem 2, we need the following lemma. 

Lemma 3. Under the conditions of Theorem 2, with probability tending 
to 1, for any given (3 1 satisfying ||/3 1 — /3 10 || = Op(n -1 / 2 ) and any constant 
C, we have 

M(*)}-*s£~M(£)}- 

Proof. We will show that, with probability tending to 1, as n — > oo, for 
any j3 1 satisfying ||/3 1 — /3 10 || = Op(n -1 / 2 ) and (3 2 satisfying \\(3 2 \\ — Cn~ 1 / 2 , 
dCp{(3) / d(3j and @j have different signs for 0j E (-Cn" 1 / 2 ,^" 1 / 2 ), for 
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j = s + 1, . . . , d. Thus, the maximum is attained at (3 2 = 0. It follows by an 
argument similar to the proof of Theorem 1 that 



W = ^§T^ = n { \ E %( X - y - Z *) - (Z 3 - ^) T B 3 + o P (n 



1=1 ) 
where f2j-(Xj, Zj) is the jth element of f2(Xj,Yi,Zj) and is the jth 
column of B. Note that ||/3 — /3 || = P (n~ 1 / 2 ) by the assumption. Thus, 
n~ 1 £'j(f3) is of the order Op{n~ 1 / 2 ). Therefore, for f3j ^ and j = s + 1, . . . ,d, 



dp. 



1 



\/n\ n 



Since hni inf n ^oo hni inf^ .^ + ^JnP\ n (\Pj\) > an ^ V^^jn —> oo, the sign of 
the derivative is completely determined by that of (5j. This completes the 
proof. □ 

Proof of Theorem 2. From Lemma 3, it follows that f3 2 = 0. We 
next establish the asymptotic normality of /3 1 . Let 9 = y/n{(3i — /3 10 ), fhn = 
3 T (C/j)Xj + Z^/3 10 and mn = Qq (^i)Xj + Z^/3 10 . Then, maximizes 

ElO^^i+n-^zS^.y^-Q^H^O.^-^EPAnC^i)- 

1=1 J=l 

(5.13) 

We consider the first term, say (. n \{6). By means of a Taylor expansion, we 
have 

n 

£ nl (0) = n- 1 ' 2 £ qiimuY^ZlO + ±0 T B nl 0, 

i=l 

where B ni = ± X)?=i Ms^C^ii + Cm)} z ii z ?i> with ( ni between and n" 1 / 2 x 
Z^#, independent of 1^. It can be shown that 

(5.14) B nl = -Ep 2 {c$(U)X + Zf^ 10 }ZiZf + o P (l) = -Bx + o P (l). 
A similar proof for (5.12) yields that 

n 

i nl (0) = n- 1 ' 2 ]T OSl^UuYi, ZiO - ±0 T Bx0 + o P (l), 

i=l 

where Qi(Ui,Yi,Zn) = qi(rnn, Y^jZn —Ti(Ui). By the Convexity Lemma 
(Pollard [19]), we have that 

n 

(Bi + S A )0 + n l ' 2 h n = n' 1 / 2 J2 toi(U i3 Y u Z a ) + o P (l). 
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The conclusion follows as claimed. □ 

Proof of Theorem 3. Decompose lZ(Hi) — TZ(H ) = I n l +I n> 2 + 4,3) 
where 

n 

4.1 = Y,{Q{9~Ha T (Ui)Xi + Z?P),Yi} - Q{ 5 - 1 (« T (L/ l )X i + Zf/3 ), YJ], 

i=i 

n 

4.2 = -^[Q{5 _1 (Zf^),Y} - Q{g-\ZJI3 ), Y}], 

i=i 

n 

4.3 = ^[Q{^ 1 (S T ([/ i )X i + Zfj8 ), Y} - Q{g-\7$P Q ), Y}]. 

i=l 

Using Theorem 10 of Fan, Zhang and Zhang [10], under i?o> we have 

where df n — > oo as n — > oo. It suffices to show that 7 nj i = op(I n ^) and 7 nj 2 = 

op (4,3)- 

A direct calculation yields that 

n 

4,1 = E 9i{Xf 5(^) + Zf /3 , Y}Zf (3 - O ) 

i=l 

n 

-10- (3 ) T J2 ZiZf q 2 {g~ 1 (a(U i )-K i + Zf f3 )}@ - O ) + 0p (l). 

i=i 

Using techniques related to those used in the proof of Theorem 2, we obtain 
1 n 

- £ Z l Zfq 2 [g~ 1 {a(U i )X i + Zf(3 }] = B + o P (l), 

n 

E 9i{«(^) x i + z f/3o, Y}Z, = nB(/3 - p Q ) + o P (l). 

i=l 

Thus, 

2/ n ,i = (3 - /3 ) T B(£J - /3 ) + op(l) x3- 

Under —2In,2 equals a likelihood ratio test statistic for Hq :j3 = (3 
versus HI : (3 / (3 . Thus, under Ho, —2I n ^ — ► xl- Thus, / nj i = op{I n ^) and 
4,2 = op{I n ^). This completes the proof. □ 
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